WEBVTT

00:00:10.729 --> 00:00:12.896
- Okay, let's get started.

00:00:16.381 --> 00:00:21.529
Okay, so today we're going to get into some of
the details about how we train neural networks.

00:00:23.166 --> 00:00:28.785
So, some administrative details first.
Assignment 1 is due today, Thursday,

00:00:28.785 --> 00:00:36.521
so 11:59 p.m. tonight on Canvas. We're also
going to be releasing Assignment 2 today,

00:00:36.521 --> 00:00:40.082
and then your project proposals
are due Tuesday, April 25th.

00:00:40.082 --> 00:00:46.591
So you should be really starting to think about
your projects now if you haven't already.

00:00:46.591 --> 00:00:54.804
How many people have decided what they want to do for
their project so far? Okay, so some, some people,

00:00:54.804 --> 00:01:03.937
so yeah, everyone else, you can go to TA office hours
if you want suggestions and bounce ideas off of TAs.

00:01:05.657 --> 00:01:18.121
We also have a list of projects that other people have proposed. Some people usually
affiliated with Stanford, so on Piazza, so you can take a look at those for additional ideas.

00:01:19.604 --> 00:01:28.004
And we also have some notes on backprop for a linear layer and
a vector and tensor derivatives that Justin's written up,

00:01:28.004 --> 00:01:33.964
so that should help with understanding how exactly
backprop works and for vectors and matrices.

00:01:33.964 --> 00:01:40.484
So these are linked to lecture four on the
syllabus and you can go and take a look at those.

00:01:45.110 --> 00:01:57.124
Okay, so where we are now. We've talked about how to express a function in terms of a
computational graph, that we can represent any function in terms of a computational graph.

00:01:57.124 --> 00:02:03.751
And we've talked more explicitly about neural networks,
which is a type of graph where we have these linear layers

00:02:03.751 --> 00:02:08.360
that we stack on top of each other
with nonlinearities in between.

00:02:09.456 --> 00:02:13.360
And we've also talked last lecture
about convolutional neural networks,

00:02:13.360 --> 00:02:24.936
which are a particular type of network that uses convolutional layers to
preserve the spatial structure throughout all the the hierarchy of the network.

00:02:24.936 --> 00:02:38.056
And so we saw exactly how a convolution layer looked, where each activation map in the convolutional
layer output is produced by sliding a filter of weights over all of the spatial locations in the input.

00:02:38.056 --> 00:02:45.456
And we also saw that usually we can have many filters per
layer, each of which produces a separate activation map.

00:02:45.456 --> 00:02:50.655
And so what we can get is from an input right, with a
certain depth, we'll get an activation map output,

00:02:50.655 --> 00:02:58.771
which has some spatial dimension that's preserved, as well as the
depth is the total number of filters that we have in that layer.

00:02:59.695 --> 00:03:05.895
And so what we want to do is we want to learn the
values of all of these weights or parameters,

00:03:05.895 --> 00:03:12.507
and we saw that we can learn our network parameters through optimization,
which we talked about little bit earlier in the course, right?

00:03:12.507 --> 00:03:17.254
And so we want to get to a point in the
loss landscape that produces a low loss,

00:03:17.254 --> 00:03:23.053
and we can do this by taking steps
in the direction of the negative gradient.

00:03:23.053 --> 00:03:27.614
And so the whole process we actually call
a Mini-batch Stochastic Gradient Descent

00:03:27.614 --> 00:03:38.585
where the steps are that we continuously, we sample a batch of data. We forward prop
it through our computational graph or our neural network. We get the loss at the end.

00:03:38.585 --> 00:03:41.960
We backprop through our network
to calculate the gradients.

00:03:41.960 --> 00:03:47.986
And then we update the parameters or the
weights in our network using this gradient.

00:03:49.980 --> 00:03:58.321
Okay, so now for the next couple of lectures we're going to talk
about some of the details involved in training neural networks.

00:03:58.321 --> 00:04:02.441
And so this involves things like how do we
set up our neural network at the beginning,

00:04:02.441 --> 00:04:11.015
which activation functions that we choose, how do we preprocess the
data, weight initialization, regularization, gradient checking.

00:04:11.015 --> 00:04:16.118
We'll also talk about training dynamics. So,
how do we babysit the learning process?

00:04:16.118 --> 00:04:21.294
How do we choose how we do parameter
updates, specific perimeter update rules,

00:04:21.294 --> 00:04:26.241
and how do we do hyperparameter optimization
to choose the best hyperparameters?

00:04:26.241 --> 00:04:28.281
And then we'll also talk about evaluation

00:04:28.281 --> 00:04:29.948
and model ensembles.

00:04:33.000 --> 00:04:41.015
So today in the first part, I will talk about activation functions,
data preprocessing, weight initialization, batch normalization,

00:04:41.015 --> 00:04:45.412
babysitting the learning process,
and hyperparameter optimization.

00:04:47.348 --> 00:04:50.348
Okay, so first activation functions.

00:04:51.708 --> 00:04:55.095
So, we saw earlier how out
of any particular layer,

00:04:55.095 --> 00:05:01.481
we have the data coming in. We multiply by our weight
in you know, fully connected or a convolutional layer.

00:05:01.481 --> 00:05:06.388
And then we'll pass this through
an activation function or nonlinearity.

00:05:06.388 --> 00:05:08.027
And we saw some examples of this.

00:05:08.027 --> 00:05:13.295
We used sigmoid previously in some of our
examples. We also saw the ReLU nonlinearity.

00:05:13.295 --> 00:05:20.479
And so today we'll talk more about different choices for
these different nonlinearities and trade-offs between them.

00:05:22.228 --> 00:05:27.241
So first, the sigmoid, which we've seen before, and
probably the one we're most comfortable with, right?

00:05:27.241 --> 00:05:32.572
So the sigmoid function is as we have up
here, one over one plus e to the negative x.

00:05:32.572 --> 00:05:45.201
And what this does is it takes each number that's input into the sigmoid nonlinearity, so each
element, and the elementwise squashes these into this range [0,1] right, using this function here.

00:05:45.201 --> 00:05:50.427
And so, if you get very high values as input,
then output is going to be something near one.

00:05:50.427 --> 00:05:55.321
If you get very low values, or, I'm sorry, very
negative values, it's going to be near zero.

00:05:55.321 --> 00:06:02.481
And then we have this regime near zero that it's in a
linear regime. It looks a bit like a linear function.

00:06:02.481 --> 00:06:05.374
And so this is been historically popular,

00:06:05.374 --> 00:06:11.530
because sigmoids, in a sense, you can interpret them as
a kind of a saturating firing rate of a neuron, right?

00:06:11.530 --> 00:06:15.455
So if it's something between zero and one,
you could think of it as a firing rate.

00:06:15.455 --> 00:06:23.588
And we'll talk later about other nonlinearities, like ReLUs that,
in practice, actually turned out to be more biologically plausible,

00:06:23.588 --> 00:06:27.402
but this does have a kind of
interpretation that you could make.

00:06:30.015 --> 00:06:36.492
So if we look at this nonlinearity more carefully, there's
several problems that there actually are with this.

00:06:36.492 --> 00:06:44.065
So the first is that saturated neurons can kill off
the gradient. And so what exactly does this mean?

00:06:44.988 --> 00:06:48.801
So if we look at a sigmoid gate right,
a node in our computational graph,

00:06:48.801 --> 00:06:54.566
and we have our data X as input into it, and then we
have the output of the sigmoid gate coming out of it,

00:06:54.566 --> 00:06:59.236
what does the gradient flow look like
as we're coming back?

00:06:59.236 --> 00:07:08.441
We have dL over d sigma right? The upstream gradient coming
down, and then we're going to multiply this by dSigma over dX.

00:07:08.441 --> 00:07:11.081
This will be the gradient
of a local sigmoid function.

00:07:11.081 --> 00:07:16.495
And we're going to chain these together for
our downstream gradient that we pass back.

00:07:16.495 --> 00:07:24.708
So who can tell me what happens when X is equal to -10?
It's very negative. What does is gradient look like?

00:07:24.708 --> 00:07:28.868
Zero, yeah, so that's right.
So the gradient become zero

00:07:28.868 --> 00:07:37.348
and that's because in this negative, very negative region of
the sigmoid, it's essentially flat, so the gradient is zero,

00:07:37.348 --> 00:07:40.001
and we chain any upstream
gradient coming down.

00:07:40.001 --> 00:07:46.501
We multiply by basically something near zero, and we're going to
get a very small gradient that's flowing back downwards, right?

00:07:46.501 --> 00:07:55.381
So, in a sense, after the chain rule, this kills the gradient flow and
you're going to have a zero gradient passed down to downstream nodes.

00:07:58.869 --> 00:08:10.015
And so what happens when X is equal to zero? So there it's,
yeah, it's fine in this regime. So, in this regime near zero,

00:08:10.015 --> 00:08:15.135
you're going to get a reasonable gradient
here, and then it'll be fine for backprop.

00:08:15.135 --> 00:08:20.055
And then what about X equals 10?
Zero, right.

00:08:20.055 --> 00:08:31.108
So again, so when X is equal to a very negative or X is equal to large positive numbers, then
these are all regions where the sigmoid function is flat, and it's going to kill off the gradient

00:08:31.108 --> 00:08:35.275
and you're not going to get
a gradient flow coming back.

00:08:37.055 --> 00:08:42.454
Okay, so a second problem is that
the sigmoid outputs are not zero centered.

00:08:42.454 --> 00:08:46.415
And so let's take a look
at why this is a problem.

00:08:46.415 --> 00:08:51.892
So, consider what happens when
the input to a neuron is always positive.

00:08:51.892 --> 00:08:54.948
So in this case, all of our Xs
we're going to say is positive.

00:08:54.948 --> 00:09:04.348
It's going to be multiplied by some weight, W, and then
we're going to run it through our activation function.

00:09:04.348 --> 00:09:08.015
So what can we say about
the gradients on W?

00:09:12.375 --> 00:09:18.135
So think about what the local gradient is
going to be, right, for this linear layer.

00:09:18.135 --> 00:09:24.214
We have DL over whatever the activation
function, the loss coming down,

00:09:24.214 --> 00:09:29.834
and then we have our local gradient,
which is going to be basically X, right?

00:09:29.834 --> 00:09:34.001
And so what does this mean,
if all of X is positive?

00:09:36.253 --> 00:09:44.401
Okay, so I heard it's always going to be positive. So that's almost right. It's
always going to be either positive, or all positive or all negative, right?

00:09:44.401 --> 00:09:53.588
So, our upstream gradient coming down is DL over our loss. L is going
to be DL over DF. and this is going to be either positive or negative.

00:09:53.588 --> 00:09:55.815
It's some arbitrary gradient coming down.

00:09:55.815 --> 00:10:06.619
And then our local gradient that we multiply this by is, if we're going to
find the gradients on W, is going to be DF over DW, which is going to be X.

00:10:07.880 --> 00:10:20.800
And if X is always positive then the gradients on W, which is multiplying these two
together, are going to always be the sign of the upstream gradient coming down.

00:10:20.800 --> 00:10:28.520
And so what this means is that all the gradients of W, since they're always
either positive or negative, they're always going to move in the same direction.

00:10:28.520 --> 00:10:42.467
You're either going to increase all of the, when you do a parameter update, you're going to either increase
all of the values of W by a positive amount, or differing positive amounts, or you will decrease them all.

00:10:42.467 --> 00:10:48.867
And so the problem with this is that, this
gives very inefficient gradient updates.

00:10:48.867 --> 00:10:59.507
So, if you look at on the right here, we have an example of a case
where, let's say W is two-dimensional, so we have our two axes for W,

00:10:59.507 --> 00:11:04.796
and if we say that we can only have
all positive or all negative updates,

00:11:04.796 --> 00:11:12.400
then we have these two quadrants, and, are the two places
where the axis are either all positive or negative,

00:11:12.400 --> 00:11:17.213
and these are the only directions in which
we're allowed to make a gradient update.

00:11:17.213 --> 00:11:25.399
And so in the case where, let's say our hypothetical
optimal W is actually this blue vector here, right,

00:11:25.399 --> 00:11:30.773
and we're starting off at you know some point, or at
the top of the the the beginning of the red arrows,

00:11:30.773 --> 00:11:38.946
we can't just directly take a gradient update in this direction,
because this is not in one of those two allowed gradient directions.

00:11:38.946 --> 00:11:43.479
And so what we're going to have to do, is we'll
have to take a sequence of gradient updates.

00:11:43.479 --> 00:11:51.953
For example, in these red arrow directions that are each in
allowed directions, in order to finally get to this optimal W.

00:11:53.039 --> 00:11:58.479
And so this is why also, in general,
we want a zero mean data.

00:11:58.479 --> 00:12:11.893
So, we want our input X to be zero meaned, so that we actually have positive and negative values and
we don't get into this problem of the gradient updates. They'll be all moving in the same direction.

00:12:11.893 --> 00:12:17.819
So is this clear? Any questions
on this point? Okay.

00:12:21.453 --> 00:12:24.930
Okay, so we've talked about these two
main problems of the sigmoid.

00:12:24.930 --> 00:12:30.586
The saturated neurons can kill the gradients if
we're too positive or too negative of an input.

00:12:30.586 --> 00:12:36.586
They're also not zero-centered and so we get
these, this inefficient kind of gradient update.

00:12:36.586 --> 00:12:43.146
And then a third problem, we have an exponential function
in here, so this is a little bit computationally expensive.

00:12:43.146 --> 00:12:46.837
In the grand scheme of your network,
this is usually not the main problem,

00:12:46.837 --> 00:12:51.186
because we have all these convolutions and
dot products that are a lot more expensive,

00:12:51.186 --> 00:12:55.103
but this is just a minor
point also to observe.

00:12:58.986 --> 00:13:03.166
So now we can look at a second
activation function here at tanh.

00:13:03.166 --> 00:13:10.999
And so this looks very similar to the sigmoid, but the
difference is that now it's squashing to the range [-1, 1].

00:13:10.999 --> 00:13:15.573
So here, the main difference
is that it's now zero-centered,

00:13:15.573 --> 00:13:21.306
so we've gotten rid of the second problem that we had. It
still kills the gradients, however, when it's saturated.

00:13:21.306 --> 00:13:29.264
So, you still have these regimes where the gradient is
essentially flat and you're going to kill the gradient flow.

00:13:29.264 --> 00:13:34.009
So this is a bit better than the sigmoid,
but it still has some problems.

00:13:36.586 --> 00:13:40.104
Okay, so now let's look at
the ReLU activation function.

00:13:40.104 --> 00:13:47.573
And this is one that we saw in our examples last lecture
when we were talking about the convolutional neural network.

00:13:47.573 --> 00:13:53.279
And we saw that we interspersed ReLU nonlinearities
between many of the convolutional layers.

00:13:53.279 --> 00:13:58.253
And so, this function is f of
x equals max of zero and x.

00:13:58.253 --> 00:14:06.573
So it takes an elementwise operation on your input and basically
if your input is negative, it's going to put it to zero.

00:14:06.573 --> 00:14:13.264
And then if it's positive, it's going to
be just passed through. It's the identity.

00:14:13.264 --> 00:14:22.892
And so this is one that's pretty commonly used, and if we look at this one and look
at and think about the problems that we saw earlier with the sigmoid and the tanh,

00:14:22.892 --> 00:14:26.746
we can see that it doesn't saturate
in the positive region.

00:14:26.746 --> 00:14:34.465
So there's whole half of our input space where it's
not going to saturate, so this is a big advantage.

00:14:34.465 --> 00:14:36.959
So this is also
computationally very efficient.

00:14:36.959 --> 00:14:42.466
We saw earlier that the sigmoid
has this E exponential in it.

00:14:42.466 --> 00:14:48.968
And so the ReLU is just this simple max
and there's, it's extremely fast.

00:14:48.968 --> 00:14:57.063
And in practice, using this ReLU, it converges much faster
than the sigmoid and the tanh, so about six times faster.

00:14:57.063 --> 00:15:01.090
And it's also turned out to be more
biologically plausible than the sigmoid.

00:15:01.090 --> 00:15:11.450
So if you look at a neuron and you look at what the inputs look like, and you look at
what the outputs look like, and you try to measure this in neuroscience experiments,

00:15:11.450 --> 00:15:18.303
you'll see that this one is actually a closer
approximation to what's happening than sigmoids.

00:15:18.303 --> 00:15:33.798
And so ReLUs were starting to be used a lot around 2012 when we had AlexNet, the first major convolutional neural
network that was able to do well on ImageNet and large-scale data. They used the ReLU in their experiments.

00:15:36.775 --> 00:15:42.082
So a problem however, with the ReLU, is that
it's still, it's not not zero-centered anymore.

00:15:42.082 --> 00:15:49.228
So we saw that the sigmoid was not zero-centered.
Tanh fixed this and now ReLU has this problem again.

00:15:49.228 --> 00:15:52.122
And so that's one of
the issues of the ReLU.

00:15:52.122 --> 00:15:55.357
And then we also have
this further annoyance of,

00:15:55.357 --> 00:16:04.222
again we saw that in the positive half of the inputs, we don't
have saturation, but this is not the case of the negative half.

00:16:04.222 --> 00:16:06.882
Right, so just thinking about this
a little bit more precisely.

00:16:06.882 --> 00:16:11.255
So what's happening here
when X equals negative 10?

00:16:11.255 --> 00:16:12.855
So zero gradient, that's right.

00:16:12.855 --> 00:16:16.522
What happens when X is
equal to positive 10?

00:16:17.455 --> 00:16:20.175
It's good, right.
So, we're in the linear regime.

00:16:20.175 --> 00:16:30.442
And then what happens when X is equal to zero? Yes, it undefined
here, but in practice, we'll say, you know, zero, right.

00:16:30.442 --> 00:16:35.074
And so basically, it's killing the
gradient in half of the regime.

00:16:37.948 --> 00:16:45.708
And so we can get this phenomenon of basically dead
ReLUs, when we're in this bad part of the regime.

00:16:45.708 --> 00:16:51.212
And so there's, you can look at this in,
as coming from several potential reasons.

00:16:51.212 --> 00:16:57.192
And so if we look at our data cloud here,
this is all of our training data,

00:16:59.033 --> 00:17:09.092
then if we look at where the ReLUs can fall, so the ReLUs can be, each of
these is basically the half of the plane where it's going to activate.

00:17:11.948 --> 00:17:15.640
And so each of these is the plane
that defines each of these ReLUs,

00:17:15.640 --> 00:17:21.201
and we can see that you can have these dead
ReLUs that are basically off of the data cloud.

00:17:21.201 --> 00:17:26.588
And in this case, it will never activate and
never update, as compared to an active ReLU

00:17:26.588 --> 00:17:31.732
where some of the data is going to be positive
and passed through and some won't be.

00:17:31.732 --> 00:17:33.480
And so there's several reasons for this.

00:17:33.480 --> 00:17:37.201
The first is that it can happen
when you have bad initialization.

00:17:37.201 --> 00:17:45.015
So if you have weights that happen to be unlucky and they happen to be
off the data cloud, so they happen to specify this bad ReLU over here.

00:17:45.015 --> 00:17:55.069
Then they're never going to get a data input that causes it to activate,
and so they're never going to get good gradient flow coming back.

00:17:56.108 --> 00:17:59.321
And so it'll just never
update and never activate.

00:17:59.321 --> 00:18:03.880
What's the more common case is
when your learning rate is too high.

00:18:03.880 --> 00:18:11.561
And so this case you started off with an okay ReLU, but because
you're making these huge updates, the weights jump around

00:18:11.561 --> 00:18:18.028
and then your ReLU unit in a sense, gets knocked off of
the data manifold. And so this happens through training.

00:18:18.028 --> 00:18:22.975
So it was fine at the beginning and then
at some point, it became bad and it died.

00:18:22.975 --> 00:18:24.108
And so if in practice,

00:18:24.108 --> 00:18:33.361
if you freeze a network that you've trained and you pass the data through, you
can see it actually is much as 10 to 20% of the network is these dead ReLUs.

00:18:33.361 --> 00:18:40.001
And so you know that's a problem, but also most networks
do have this type of problem when you use ReLUs.

00:18:40.001 --> 00:18:49.467
Some of them will be dead, and in practice, people look into this, and
it's a research problem, but it's still doing okay for training networks.

00:18:49.467 --> 00:18:51.268
Yeah, is there a question?

00:18:51.268 --> 00:18:54.851
[student speaking off mic]

00:19:01.908 --> 00:19:05.335
Right. So the question is, yeah, so the
data cloud is just your training data.

00:19:05.335 --> 00:19:08.918
[student speaking off mic]

00:19:17.641 --> 00:19:25.708
Okay, so the question is when, how do you tell when the ReLU
is going to be dead or not, with respect to the data cloud?

00:19:25.708 --> 00:19:30.988
And so if you look at, this is an example
of like a simple two-dimensional case.

00:19:30.988 --> 00:19:42.278
And so our ReLU, we're going to get our input to the ReLU, which is going
to be a basically you know, W1 X1 plus W2 X2, and it we apply this,

00:19:42.278 --> 00:19:46.080
so that that defines this this
separating hyperplane here,

00:19:46.080 --> 00:19:51.453
and then we're going to take half of it that's going to
be positive, and half of it's going to be killed off,

00:19:51.453 --> 00:20:03.789
and so yes, so you, you know you just, it's whatever the weights happened to be, and
where the data happens to be is where these, where these hyperplanes fall, and so,

00:20:05.560 --> 00:20:14.329
so yeah so just throughout the course of training, some of your
ReLUs will be in different places, with respect to the data cloud.

00:20:16.480 --> 00:20:18.050
Oh, question.

00:20:18.050 --> 00:20:21.633
[student speaking off mic]

00:20:23.380 --> 00:20:33.478
Yeah. So okay, so the question is for the sigmoid we talked about two
drawbacks, and one of them was that the neurons can get saturated,

00:20:37.045 --> 00:20:40.500
so let's go back to the sigmoid here,

00:20:40.500 --> 00:20:45.820
and the question was this is not the case,
when all of your inputs are positive.

00:20:45.820 --> 00:20:51.971
So when all of your inputs are positive, they're all
going to be coming in in this zero plus region here,

00:20:51.971 --> 00:20:54.464
and so you can still
get a saturating neuron,

00:20:54.464 --> 00:21:00.544
because you see up in this positive
region, it also plateaus at one,

00:21:00.544 --> 00:21:08.846
and so when it's when you have large positive values as input you're also
going to get the zero gradient, because you have you have a flat slope here.

00:21:10.715 --> 00:21:11.548
Okay.

00:21:16.355 --> 00:21:24.528
Okay, so in practice people also like to
initialize ReLUs with slightly positive biases,

00:21:24.528 --> 00:21:30.721
in order to increase the likelihood of it being
active at initialization and to get some updates.

00:21:30.721 --> 00:21:40.430
Right and so this basically just biases towards more ReLUs firing at the
beginning, and in practice some say that it helps. Some say that it doesn't.

00:21:40.430 --> 00:21:48.072
Generally people don't always use this. It's yeah, a lot
of times people just initialize it with zero biases still.

00:21:49.483 --> 00:21:54.777
Okay, so now we can look at some modifications
on the ReLU that have come out since then,

00:21:54.777 --> 00:21:57.768
and so one example is this leaky ReLU.

00:21:57.768 --> 00:22:04.429
And so this looks very similar to the original ReLU, and the only
difference is that now instead of being flat in the negative regime,

00:22:04.429 --> 00:22:11.955
we're going to give a slight negative slope here And so this
solves a lot of the problems that we mentioned earlier.

00:22:11.955 --> 00:22:17.142
Right here we don't have any saturating
regime, even in the negative space.

00:22:17.142 --> 00:22:23.968
It's still very computationally efficient. It still converges
faster than sigmoid and tanh, very similar to a ReLU.

00:22:23.968 --> 00:22:27.218
And it doesn't have this dying problem.

00:22:28.923 --> 00:22:35.380
And there's also another example
is the parametric rectifier, so PReLU.

00:22:35.380 --> 00:22:42.195
And so in this case it's just like a leaky ReLU where
we again have this sloped region in the negative space,

00:22:42.195 --> 00:22:47.088
but now this slope in the negative regime
is determined through this alpha parameter,

00:22:47.088 --> 00:22:52.982
so we don't specify, we don't hard-code it. but we treat
it as now a parameter that we can backprop into and learn.

00:22:52.982 --> 00:22:57.555
And so this gives it a
little bit more flexibility.

00:22:57.555 --> 00:23:02.342
And we also have something called
an Exponential Linear Unit, an ELU,

00:23:02.342 --> 00:23:08.295
so we have all these different LUs,
basically. and this one again, you know,

00:23:08.295 --> 00:23:10.341
it has all the benefits of the ReLu,

00:23:10.341 --> 00:23:14.508
but now you're, it is also
closer to zero mean outputs.

00:23:16.181 --> 00:23:24.901
So, that's actually an advantage that the leaky ReLU, parametric ReLU,
a lot of these they allow you to have your mean closer to zero,

00:23:26.699 --> 00:23:36.538
but compared with the leaky ReLU, instead of it being sloped in the negative
regime, here you actually are building back in a negative saturation regime,

00:23:36.538 --> 00:23:43.029
and there's arguments that basically this allows
you to have some more robustness to noise,

00:23:43.029 --> 00:23:48.566
and you basically get these deactivation
states that can be more robust.

00:23:48.566 --> 00:23:55.885
And you can look at this paper for, there's a lot of
kind of more justification for why this is the case.

00:23:55.885 --> 00:24:01.111
And in a sense this is kind of something
in between the ReLUs and the leaky ReLUs,

00:24:01.111 --> 00:24:13.267
where has some of this shape, which the Leaky ReLU does, which gives it closer to zero mean
output, but then it also still has some of this more saturating behavior that ReLUs have.

00:24:13.267 --> 00:24:14.350
A question?

00:24:14.350 --> 00:24:17.933
[student speaking off mic]

00:24:19.952 --> 00:24:24.365
So, whether this parameter alpha
is going to be specific for each neuron.

00:24:24.365 --> 00:24:34.090
So, I believe it is often specified, but I actually can't remember exactly,
so you can look in the paper for exactly, yeah, how this is defined,

00:24:35.578 --> 00:24:45.050
but yeah, so I believe this function is basically very
carefully designed in order to have nice desirable properties.

00:24:45.050 --> 00:24:49.992
Okay, so there's basically all of these
kinds of variants on the ReLU.

00:24:49.992 --> 00:24:58.192
And so you can see that, all of these it's kind of, you can argue that
each one may have certain benefits, certain drawbacks in practice.

00:24:58.192 --> 00:25:04.950
People just want to run experiments all of them, and see empirically
what works better, try and justify it, and come up with new ones,

00:25:04.950 --> 00:25:08.612
but they're all different things
that are being experimented with.

00:25:10.135 --> 00:25:14.744
And so let's just mention one more.
This is Maxout Neuron.

00:25:14.744 --> 00:25:25.969
So, this one looks a little bit different in that it doesn't have the same form as the others did
of taking your basic dot product, and then putting this element-wise nonlinearity in front of it.

00:25:25.969 --> 00:25:34.670
Instead, it looks like this, this max of W dot product of X plus
B, and a second set of weights, W2 dot product with X plus B2.

00:25:38.230 --> 00:25:43.185
And so what does this, is this is taking
the max of these two functions in a sense.

00:25:44.870 --> 00:25:48.949
And so what it does is it generalizes
the ReLU and the leaky ReLu,

00:25:48.949 --> 00:25:54.112
because you're just you're taking the max
over these two, two linear functions.

00:25:55.023 --> 00:26:02.927
And so what this give us, it's again you're operating in
a linear regime. It doesn't saturate and it doesn't die.

00:26:02.927 --> 00:26:15.984
The problem is that here, you are doubling the number of parameters per neuron. So, each neuron
now has this original set of weights, W, but it now has W1 and W2, so you have twice these.

00:26:17.765 --> 00:26:24.560
So in practice, when we look at all of these activation
functions, kind of a good general rule of thumb is use ReLU.

00:26:24.560 --> 00:26:29.389
This is the most standard one
that generally just works well.

00:26:30.231 --> 00:26:36.497
And you know you do want to be careful in general with your
learning rates to adjust them based, see how things do.

00:26:36.497 --> 00:26:40.091
We'll talk more about adjusting
learning rates later in this lecture,

00:26:40.091 --> 00:26:52.318
but you can also try out some of these fancier activation functions, the leaky ReLU,
Maxout, ELU, but these are generally, they're still kind of more experimental.

00:26:53.828 --> 00:26:56.643
So, you can see how they
work for your problem.

00:26:56.643 --> 00:27:04.035
You can also try out tanh, but probably some of these
ReLU and ReLU variants are going to be better.

00:27:04.035 --> 00:27:15.243
And in general don't use sigmoid. This is one of the earliest original activation
functions, and ReLU and these other variants have generally worked better since then.

00:27:17.361 --> 00:27:21.517
Okay, so now let's talk a little bit
about data preprocessing.

00:27:21.517 --> 00:27:24.602
Right, so the activation function,
we design this is part of our network.

00:27:24.602 --> 00:27:30.361
Now we want to train the network, and we have our
input data that we want to start training from.

00:27:31.424 --> 00:27:39.495
So, generally we want to always preprocess the data, and this is something that
you've probably seen before in machine learning classes if you taken those.

00:27:39.495 --> 00:27:49.366
And some standard types of preprocessing are, you take your original data and
you want to zero mean them, and then you probably want to also normalize that,

00:27:49.366 --> 00:27:57.367
so normalized by the standard deviation,
And so why do we want to do this?

00:27:57.367 --> 00:28:04.979
For zero centering, you can remember earlier that we talked
about when all the inputs are positive, for example,

00:28:04.979 --> 00:28:12.772
then we get all of our gradients on the weights to be
positive, and we get this basically suboptimal optimization.

00:28:12.772 --> 00:28:21.710
And in general even if it's not all zero or all negative,
any sort of bias will still cause this type of problem.

00:28:23.770 --> 00:28:36.440
And so then in terms of normalizing the data, this is basically you want to normalize data typically in the
machine learning problems, so that all features are in the same range, and so that they contribute equally.

00:28:36.440 --> 00:28:45.866
In practice, since for images, which is what we're dealing with in
this course here for the most part, we do do the zero centering,

00:28:45.866 --> 00:28:56.616
but in practice we don't actually normalize the pixel value so much, because generally for
images right at each location you already have relatively comparable scale and distribution,

00:28:56.616 --> 00:29:09.339
and so we don't really need to normalize so much, compared to more general machine learning problems,
where you might have different features that are very different and of very different scales.

00:29:11.037 --> 00:29:19.983
And in machine learning, you might also see a more complicated
things, like PCA or whitening, but again with images,

00:29:19.983 --> 00:29:28.678
we typically just stick with the zero mean, and we don't do the normalization,
and we also don't do some of these more complicated pre-processing.

00:29:29.519 --> 00:29:40.876
And one reason for this is generally with images we don't really want to take all of our input, let's say pixel
values and project this onto a lower dimensional space of new kinds of features that we're dealing with.

00:29:40.876 --> 00:29:48.184
We typically just want to apply convolutional networks spatially
and have our spatial structure over the original image.

00:29:48.184 --> 00:29:49.595
Yeah, question.

00:29:49.595 --> 00:29:53.178
[student speaking off mic]

00:29:58.858 --> 00:30:06.968
So the question is we do this pre-processing in a training phase, do we
also do the same kind of thing in the test phase, and the answer is yes.

00:30:06.968 --> 00:30:24.839
So, let me just move to the next slide here. So, in general on the training phase is where we determine our let's say, mean, and
then we apply this exact same mean to the test data. So, we'll normalize by the same empirical mean from the training data.

00:30:24.839 --> 00:30:35.822
Okay, so to summarize basically for images, we typically just do the zero
mean pre-processing and we can subtract either the entire mean image.

00:30:38.151 --> 00:30:41.354
So, from the training data,
you compute the mean image,

00:30:41.354 --> 00:30:54.777
which will be the same size as your, as each image. So, for example 32 by 32 by three, you'll get this array
of numbers, and then you subtract that from each image that you're about to pass through the network,

00:30:54.777 --> 00:31:00.532
and you'll do the same thing at test time for
this array that you determined at training time.

00:31:00.532 --> 00:31:14.916
In practice, we can also for some networks, we also do this by just of subtracting a per-channel mean, and so
instead of having an entire mean image that were going to zero-center by, we just take the mean by channel,

00:31:14.916 --> 00:31:25.718
and this is just because it turns out that it was similar enough across the whole image, it
didn't make such a big difference to subtract the mean image versus just a per-channel value.

00:31:25.718 --> 00:31:36.936
And this is easier to just pass around and deal with. So, you'll see this as well for example,
in a VGG Network, which is a network that came after AlexNet, and we'll talk about that later.

00:31:36.936 --> 00:31:38.545
Question.

00:31:38.545 --> 00:31:42.128
[student speaking off mic]

00:31:45.215 --> 00:31:52.049
Okay, so there are two questions. The first is what's a channel,
in this case, when we are subtracting a per-channel mean?

00:31:52.049 --> 00:32:04.198
And this is RGB, so our array, our images are typically for example, 32 by 32 by
three. So, width, height, each are 32, and our depth, we have three channels RGB,

00:32:04.198 --> 00:32:09.786
and so we'll have one mean for the red
channel, one mean for a green, one for blue.

00:32:09.786 --> 00:32:14.529
And then the second, what
was your second question?

00:32:14.529 --> 00:32:18.112
[student speaking off mic]

00:32:21.349 --> 00:32:26.827
Oh. Okay, so the question is when we're subtracting
the mean image, what is the mean taken over?

00:32:27.882 --> 00:32:39.114
And the mean is taking over all of your training images. So, you'll take all of your
training images and just compute the mean of all of those. Does that make sense?

00:32:39.114 --> 00:32:42.697
[student speaking off mic]

00:32:48.432 --> 00:32:55.255
Yeah the question is, we do this for the entire training set,
once before we start training. We don't do this per batch,

00:32:55.255 --> 00:32:57.904
and yeah, that's exactly correct.

00:32:57.904 --> 00:33:03.984
So we just want to have a good sample,
an empirical mean that we have.

00:33:03.984 --> 00:33:13.983
And so if you take it per batch, if you're sampling reasonable batches, it
should be basically, you should be getting the same values anyways for the mean,

00:33:13.983 --> 00:33:19.126
and so it's more efficient and easier
just do this once at the beginning.

00:33:19.126 --> 00:33:28.296
You might not even have to really take it over the entire training data. You could
also just sample enough training images to get a good estimate of your mean.

00:33:30.734 --> 00:33:35.560
Okay, so any other questions
about data preprocessing? Yes.

00:33:35.560 --> 00:33:38.654
[student speaking off mic]

00:33:38.654 --> 00:33:42.187
So, the question is does the data
preprocessing solve the sigmoid problem?

00:33:42.187 --> 00:33:46.354
So the data preprocessing
is doing zero mean right?

00:33:47.540 --> 00:33:50.535
And we talked about how sigmoid,
we want to have zero mean.

00:33:50.535 --> 00:33:56.262
And so it does solve this for the
first layer that we pass it through.

00:33:56.262 --> 00:34:00.263
So, now our inputs to the first layer
of our network is going to be zero mean,

00:34:00.263 --> 00:34:08.472
but we'll see later on that we're actually going to have this problem
come up in much worse and greater form, as we have deep networks.

00:34:08.472 --> 00:34:12.437
You're going to get a lot
of nonzero mean problems later on.

00:34:12.438 --> 00:34:19.350
And so in this case, this is not going to be sufficient.
So this only helps at the first layer of your network.

00:34:21.784 --> 00:34:28.203
Okay, so now let's talk about how do we want
to initialize the weights of our network?

00:34:28.204 --> 00:34:34.471
So, we have let's say our standard two layer neural network
and we have all of these weights that we want to learn,

00:34:34.472 --> 00:34:43.509
but we have to start them with some value, right? And then we're
going to update them using our gradient updates from there.

00:34:43.510 --> 00:34:56.157
So first question. What happens when we use an initialization of W equals zero?
We just set all of the parameters to be zero. What's the problem with this?

00:34:56.157 --> 00:34:58.683
[student speaking off mic]

00:34:58.683 --> 00:35:00.766
So sorry, say that again.

00:35:02.039 --> 00:35:08.320
So I heard all the neurons are going to
be dead. No updates ever. So not exactly.

00:35:11.035 --> 00:35:16.995
So, part of that is correct in that all the neurons
will do the same thing. So, they might not all be dead.

00:35:16.995 --> 00:35:23.321
Depending on your input value, I mean, you could be in
any regime of your neurons, so they might not be dead,

00:35:23.321 --> 00:35:27.869
but the key thing is that they
will all do the same thing.

00:35:27.869 --> 00:35:36.577
So, since your weights are zero, given an input, every neuron is going
to be, have the same operation basically on top of your inputs.

00:35:36.577 --> 00:35:43.621
And so, since they're all going to output the same
thing, they're also all going to get the same gradient.

00:35:43.621 --> 00:35:47.571
And so, because of that, they're all
going to update in the same way.

00:35:47.571 --> 00:35:51.983
And now you're just going to get all neurons that
are exactly the same, which is not what you want.

00:35:51.983 --> 00:35:54.075
You want the neurons to
learn different things.

00:35:54.075 --> 00:35:58.514
And so, that's the problem
when you initialize everything equally

00:35:58.514 --> 00:36:02.730
and there's basically no
symmetry breaking here.

00:36:02.730 --> 00:36:05.961
So, what's the first, yeah question?

00:36:05.961 --> 00:36:09.544
[student speaking off mic]

00:36:19.699 --> 00:36:29.961
So the question is, because that, because the gradient also depends
on our loss, won't one backprop differently compared to the other?

00:36:29.961 --> 00:36:46.072
So in the last layer, like yes, you do have basically some of this, the gradients will get the same,
sorry, will get different loss for each specific neuron based on which class it was connected to,

00:36:46.072 --> 00:36:54.352
but if you look at all the neurons generally throughout your network, like you're going
to, you basically have a lot of these neurons that are connected in exactly the same way.

00:36:54.352 --> 00:36:59.885
They had the same updates and it's
basically going to be the problem.

00:36:59.885 --> 00:37:10.885
Okay, so the first idea that we can have to try and improve upon this is to set all
of the weights to be small random numbers that we can sample from a distribution.

00:37:10.885 --> 00:37:16.002
So, in this case, we're going to sample
from basically a standard gaussian,

00:37:16.002 --> 00:37:22.450
but we're going to scale it so that the standard
deviation is actually one E negative two, 0.01.

00:37:22.450 --> 00:37:25.640
And so, just give this
many small random weights.

00:37:25.640 --> 00:37:30.729
And so, this does work okay for small
networks, now we've broken the symmetry,

00:37:30.729 --> 00:37:34.896
but there's going to be
problems with deeper networks.

00:37:35.970 --> 00:37:43.070
And so, let's take a look at why this is the case. So,
here this is basically an experiment that we can do

00:37:43.070 --> 00:37:45.341
where let's take a deeper network.

00:37:45.341 --> 00:37:53.622
So in this case, let's initialize a 10 layer neural
network to have 500 neurons in each of these 10 layers.

00:37:53.622 --> 00:37:56.437
Okay, we'll use tanh
nonlinearities in this case

00:37:56.437 --> 00:38:06.116
and we'll initialize it with small random numbers as we described in the
last slide. So here, we're going to basically just initialize this network.

00:38:06.116 --> 00:38:12.356
We have random data that we're going to take, and
now let's just pass it through the entire network,

00:38:12.356 --> 00:38:18.725
and at each layer, look at the statistics of
the activations that come out of that layer.

00:38:22.476 --> 00:38:25.485
And so, what we'll see this is probably
a little bit hard to read up top,

00:38:25.485 --> 00:38:31.156
but if we compute the mean
and the standard deviations at each layer,

00:38:31.156 --> 00:38:39.410
well see that at the first layer this is,
the means are always around zero.

00:38:40.267 --> 00:38:48.219
There's a funny sound in here.
Interesting, okay well that was fixed.

00:38:49.613 --> 00:38:58.153
So, if we look at, if we look at the outputs from here, the
mean is always going to be around zero, which makes sense.

00:38:58.153 --> 00:39:01.175
So, if we look here, let's see,

00:39:01.175 --> 00:39:11.420
if we take this, we looked at the dot product of X with W, and then
we took the tanh on linearity, and then we store these values and so,

00:39:12.315 --> 00:39:16.780
because it tanh is centered around zero,
this will make sense,

00:39:16.780 --> 00:39:22.450
and then the standard deviation however
shrinks, and it quickly collapses to zero.

00:39:22.450 --> 00:39:32.019
So, if we're plotting this, here this second row of plots here is showing the
mean and standard deviations over time per layer and then in the bottom,

00:39:32.019 --> 00:39:38.592
the sequence of plots is showing for each of our layers.
What's the distribution of the activations that we have?

00:39:38.592 --> 00:39:45.206
And so, we can see that at the first layer, we still have a
reasonable gaussian looking thing. It's a nice distribution.

00:39:45.206 --> 00:39:58.591
But the problem is that as we multiply by this W, these small numbers at each layer, this
quickly shrinks and collapses all of these values, as we multiply this over and over again.

00:39:58.591 --> 00:40:02.191
And so, by the end, we
get all of these zeros,

00:40:02.191 --> 00:40:04.262
which is not what we want.

00:40:04.262 --> 00:40:07.457
So we get all the activations become zero.

00:40:07.457 --> 00:40:10.420
And so now let's think
about the backwards pass.

00:40:10.420 --> 00:40:16.144
So, if we do a backward pass, now assuming this was our
forward pass and now we want to compute our gradients.

00:40:16.144 --> 00:40:20.024
So first, what does the gradients
look like on the weights?

00:40:24.155 --> 00:40:26.238
Does anyone have a guess?

00:40:28.571 --> 00:40:36.531
So, if we think about this, we have our input
values are very small at each layer right,

00:40:36.531 --> 00:40:43.273
because they've all collapsed at this near zero, and then
now each layer, we have our upstream gradient flowing down,

00:40:43.273 --> 00:40:53.483
and then in order to get the gradient on the weights remember it's our upstream gradient
times our local gradient, which for this this dot product were doing W times X.

00:40:53.483 --> 00:40:56.985
It's just basically going to
be X, which is our inputs.

00:40:56.985 --> 00:41:00.571
So, it's again a similar kind of problem
that we saw earlier,

00:41:00.571 --> 00:41:07.058
where now since, so here because X is small, our weights are
getting a very small gradient, and they're basically not updating.

00:41:07.058 --> 00:41:13.488
So, this is a way that you can basically try and think
about the effect of gradient flows through your networks.

00:41:13.488 --> 00:41:20.329
You can always think about what the forward pass is doing, and then
think about what's happening as you have gradient flows coming down,

00:41:20.329 --> 00:41:28.562
and different types of inputs, what the effect of this
actually is on our weights and the gradients on them.

00:41:28.562 --> 00:41:38.025
And so also, if now if we think about what's the gradient that's going to
be flowing back from each layer as we're chaining all these gradients.

00:41:40.004 --> 00:41:50.291
Alright, so this is going to be the flip thing where we have now the gradient flowing
back is our upstream gradient times in this case the local gradient is W on our input X.

00:41:50.291 --> 00:41:53.085
And so again, because
this is the dot product,

00:41:53.085 --> 00:42:06.208
and so now, actually going backwards at each layer, we're basically doing a multiplication
of the upstream gradient by our weights in order to get the next gradient flowing downwards.

00:42:07.283 --> 00:42:18.198
And so because here, we're multiplying by W over and over again. You're getting basically the
same phenomenon as we had in the forward pass where everything is getting smaller and smaller.

00:42:18.198 --> 00:42:23.541
And now the gradient, upstream gradients
are collapsing to zero as well.

00:42:23.541 --> 00:42:24.869
Question?

00:42:24.869 --> 00:42:28.452
[student speaking off mic]

00:42:30.731 --> 00:42:37.945
Yes, I guess upstream and downstream is, can be interpreted
differently, depending on if you're going forward and backward,

00:42:37.945 --> 00:42:43.907
but in this case we're going, we're doing, we're going
backwards, right? We're doing back propagation.

00:42:43.907 --> 00:42:51.409
And so upstream is the gradient flowing, you can think of
a flow from your loss, all the way back to your input.

00:42:51.409 --> 00:42:58.684
And so upstream is what came from what you've already
done, flowing you know, down into your current node.

00:43:00.270 --> 00:43:07.521
Right, so we're for flowing downwards, and what we get coming
into the node through backprop is coming from upstream.

00:43:13.888 --> 00:43:21.102
Okay, so now let's think about what happens when, you know we saw
that this was a problem when our weights were pretty small, right?

00:43:21.102 --> 00:43:26.133
So, we can think about well, what if we just
try and solve this by making our weights big?

00:43:26.133 --> 00:43:38.273
So, let's sample from this standard gaussian, now with standard deviation
one instead of 0.01. So what's the problem here? Does anyone have a guess?

00:43:44.558 --> 00:43:54.750
If our weights are now all big, and we're passing them, and we're taking
these outputs of W times X, and passing them through tanh nonlinearities,

00:43:54.750 --> 00:44:01.883
remember we were talking about what happens at different
values of inputs to tanh, so what's the problem?

00:44:01.883 --> 00:44:06.289
Okay, so yeah I heard that it's going
to be saturated, so that's right.

00:44:06.289 --> 00:44:15.966
Basically now, because our weights are going to be big, we're going to always
be at saturated regimes of either very negative or very positive of the tanh.

00:44:15.966 --> 00:44:29.695
And so in practice, what you're going to get here is now if we look at the distribution of the activations
at each of the layers here on the bottom, they're going to be all basically negative one or plus one.

00:44:30.855 --> 00:44:40.447
Right, and so this will have the problem that we talked about with the tanh earlier, when
they're saturated, that all the gradients will be zero, and our weights are not updating.

00:44:41.397 --> 00:44:46.363
So basically, it's really hard to get
your weight initialization right.

00:44:46.363 --> 00:44:50.296
When it's too small they all collapse.
When it's too large they saturate.

00:44:50.296 --> 00:44:55.553
So, there's been some work in trying to figure out well,
what's the proper way to initialize these weights.

00:44:55.553 --> 00:45:02.507
And so, one kind of good rule of thumb that
you can use is the Xavier initialization.

00:45:02.507 --> 00:45:07.388
And so this is from this
paper by Glorot in 2010.

00:45:07.388 --> 00:45:15.962
And so what this formula is, is if we look at W up here,
we can see that we want to initialize them to these,

00:45:17.403 --> 00:45:22.653
we sample from our standard gaussian, and then we're
going to scale by the number of inputs that we have.

00:45:22.653 --> 00:45:28.599
And you can go through the math, and you can see in the lecture
notes as well as in this paper of exactly how this works out,

00:45:28.599 --> 00:45:35.789
but basically the way we do it is we specify that we want the
variance of the input to be the same as a variance of the output,

00:45:35.789 --> 00:45:42.789
and then if you derive what the weight should be you'll get
this formula, and intuitively with this kind of means is that

00:45:42.789 --> 00:45:52.654
if you have a small number of inputs right, then we're going to divide by the smaller
number and get larger weights, and we need larger weights, because with small inputs,

00:45:52.654 --> 00:45:58.993
and you're multiplying each of these by weight, you need a
larger weights to get the same larger variance at output,

00:45:58.993 --> 00:46:08.505
and kind of vice versa for if we have many inputs, then we want
smaller weights in order to get the same spread at the output.

00:46:08.505 --> 00:46:10.795
So, you can look at the notes
for more details about this.

00:46:10.795 --> 00:46:23.150
And so basically now, if we want to have a unit gaussian, right as input to each layer, we
can use this kind of initialization to at training time, to be able to initialize this,

00:46:23.150 --> 00:46:27.669
so that there is approximately
a unit gaussian at each layer.

00:46:29.057 --> 00:46:35.032
Okay, and so one thing is does assume though is
that it is assumed that there's linear activations.

00:46:35.032 --> 00:46:40.837
and so it assumes that we are in the activation,
in the active region of the tanh, for example.

00:46:40.837 --> 00:46:46.051
And so again, you can look at the notes to
really try and understand its derivation,

00:46:46.051 --> 00:46:51.255
but the problem is that this breaks
when now you use something like a ReLU.

00:46:51.255 --> 00:46:54.849
Right, and so with the
ReLU what happens is that,

00:46:54.849 --> 00:47:04.685
because it's killing half of your units, it's setting approximately half of them
to zero at each time, it's actually halving the variance that you get out of this.

00:47:04.685 --> 00:47:16.193
And so now, if you just make the same assumptions as your derivation earlier you
won't actually get the right variance coming out, it's going to be too small.

00:47:16.193 --> 00:47:23.323
And so what you see is again this kind of
phenomenon, as the distributions starts collapsing.

00:47:23.323 --> 00:47:28.019
In this case you get more and more peaked
toward zero, and more units deactivated.

00:47:29.541 --> 00:47:41.580
And the way to address this with something that has been pointed out in some papers,
which is that you can you can try to account for this with an extra, divided by two.

00:47:41.580 --> 00:47:47.023
So, now you're basically adjusting for the
fact that half the neurons get killed.

00:47:48.636 --> 00:47:58.122
And so you're kind of equivalent input has actually half this number of input,
and so you just add this divided by two factor in, this works much better,

00:47:59.332 --> 00:48:05.348
and you can see that the distributions are pretty
good throughout all layers of the network.

00:48:06.959 --> 00:48:16.161
And so in practice this is been really important actually, for training these types of
little things, to a really pay attention to how your weights are, make a big difference.

00:48:16.161 --> 00:48:28.309
And so for example, you'll see in some papers that this actually is the difference
between the network even training at all and performing well versus nothing happening.

00:48:32.548 --> 00:48:36.321
So, proper initialization is still
an active area of research.

00:48:36.321 --> 00:48:40.281
And so if you're interested in this, you can
look at a lot of these papers and resources.

00:48:40.281 --> 00:48:51.701
A good general rule of thumb is basically use the Xavier Initialization to start
with, and then you can also think about some of these other kinds of methods.

00:48:53.871 --> 00:49:01.405
And so now we're going to talk about a related idea to this, so this
idea of wanting to keep activations in a gaussian range that we want.

00:49:03.330 --> 00:49:09.672
Right, and so this idea behind what we're going to call batch
normalization is, okay we want unit gaussian activations.

00:49:09.672 --> 00:49:14.240
Let's just make them that way.
Let's just force them to be that way.

00:49:14.240 --> 00:49:15.834
And so how does this work?

00:49:15.834 --> 00:49:25.640
So, let's consider a batch of activations at some layer. And so now we have
all of our activations coming out. If we want to make this unit gaussian,

00:49:25.640 --> 00:49:29.368
we actually can just do
this empirically, right.

00:49:29.368 --> 00:49:39.392
We can take the mean of the batch that we have so far of the current batch,
and we can just and the variance, and we can just normalize by this.

00:49:39.392 --> 00:49:50.867
Right, and so basically, instead of with weight initialization, we're setting this at the start of
training so that we try and get it into a good spot that we can have unit gaussians at every layer,

00:49:50.867 --> 00:49:53.096
and hopefully during training
this will preserve this.

00:49:53.096 --> 00:49:58.336
Now we're going to explicitly make that happen
on every forward pass through the network.

00:49:58.336 --> 00:50:06.787
We're going to make this happen functionally, and basically
by normalizing by the mean and the variance of each neuron,

00:50:08.139 --> 00:50:15.754
we look at all of the inputs coming into it and calculate the
mean and variance for that batch and normalize it by it.

00:50:15.754 --> 00:50:19.928
And the thing is that this is a, this is
just a differentiable function right?

00:50:19.928 --> 00:50:31.098
If we have our mean and our variance as constants, this is just a sequence of
computational operations that we can differentiate and do back prop through this.

00:50:33.115 --> 00:50:47.065
Okay, so just as I was saying earlier right, if we look at our input data, and we think of
this as we have N training examples in our current batch, and then each batch has dimension D,

00:50:47.065 --> 00:50:56.063
we're going to the compute the empirical mean and variance
independently for each dimension, so each basically feature element,

00:50:56.063 --> 00:51:02.406
and we compute this across our batch, our current
mini-batch that we have and we normalize by this.

00:51:05.786 --> 00:51:09.988
And so this is usually inserted after
fully connected or convolutional layers.

00:51:09.988 --> 00:51:18.932
We saw that would we were multiplying by W in these layers, which we do over
and over again, then we can have this bad scaling effect with each one.

00:51:18.932 --> 00:51:22.731
And so this basically is
able to undo this effect.

00:51:22.731 --> 00:51:37.132
Right, and since we're basically just scaling by the inputs connected to each neuron, each activation,
we can apply this the same way to fully connected convolutional layers, and the only difference is that,

00:51:37.132 --> 00:51:45.895
with convolutional layers, we want to normalize not just across all the
training examples, and independently for each each feature dimension,

00:51:45.895 --> 00:51:58.895
but we actually want to normalize jointly across both all the feature dimensions, all the
spatial locations that we have in our activation map, as well as all of the training examples.

00:51:58.895 --> 00:52:05.903
And we do this, because we want to obey the convolutional property,
and we want nearby locations to be normalized the same way, right?

00:52:05.903 --> 00:52:13.489
And so with a convolutional layer, we're basically going to have a one
mean and one standard deviation, per activation map that that we have,

00:52:13.489 --> 00:52:18.094
and we're going to normalize by this
across all of the examples in the batch.

00:52:18.094 --> 00:52:23.098
And so this is something that you guys are
going to implement in your next homework.

00:52:23.098 --> 00:52:29.367
And so, all of these details are explained
very clearly in this paper from 2015.

00:52:29.367 --> 00:52:35.621
And so on this is a very useful, useful technique
that you want to use a lot in practice.

00:52:35.621 --> 00:52:46.129
You want to have these batch normalization layers. And so you should read this
paper. Go through all of the derivations, and then also go through the derivations

00:52:46.129 --> 00:52:53.718
of how to compute the gradients with given
these, this normalization operation.

00:52:56.626 --> 00:52:59.993
Okay, so one thing that I just
want to point out is that,

00:52:59.993 --> 00:53:05.930
it's not clear that, you know, we're doing this batch
normalization after every fully connected layer,

00:53:05.930 --> 00:53:12.031
and it's not clear that we necessarily want a
unit gaussian input to these tanh nonlinearities,

00:53:12.031 --> 00:53:17.107
because what this is doing is this is constraining
you to the linear regime of this nonlinearity,

00:53:17.107 --> 00:53:21.974
and we're not actually, you're trying to basically
say, let's not have any of this saturation,

00:53:21.974 --> 00:53:30.821
but maybe a little bit of this is good, right? You you want to be
able to control what's, how much saturation that you want to have.

00:53:31.845 --> 00:53:39.512
And so what, the way that we address this when we're doing batch
normalization is that we have our normalization operation,

00:53:39.512 --> 00:53:44.453
but then after that we have this additional
squashing and scaling operation.

00:53:44.453 --> 00:53:52.515
So, we do our normalization. Then we're going to scale by some
constant gamma, and then shift by another factor of beta.

00:53:53.349 --> 00:54:02.071
Right, and so what this actually does is that this allows you
to be able to recover the identity function if you wanted to.

00:54:02.071 --> 00:54:10.613
So, if the network wanted to, it could learn your scaling factor gamma
to be just your variance. It could learn your beta to be your mean,

00:54:10.613 --> 00:54:16.659
and in this case you can recover the identity
mapping, as if you didn't have batch normalization.

00:54:16.659 --> 00:54:32.225
And so now you have the flexibility of doing kind of everything in between and making your the network learning
how to make your tanh more or less saturated, and how much to do so in order to have, to have good training.

00:54:38.166 --> 00:54:42.285
Okay, so just to sort of summarize
the batch normalization idea.

00:54:42.285 --> 00:54:52.906
Right, so given our inputs, we're going to compute our mini-batch mean. So,
we do this for every mini-batch that's coming in. We compute our variance.

00:54:52.906 --> 00:54:58.342
We normalize by the mean and variance, and we
have this additional scaling and shifting factor.

00:54:58.342 --> 00:55:05.484
And so this improves gradient flow through the
network. it's also more robust as a result.

00:55:05.484 --> 00:55:10.562
It works for more range of learning rates,
and different kinds of initialization,

00:55:10.562 --> 00:55:16.955
so people have seen that once you put batch normalization in, and
it's just easier to train, and so that's why you should do this.

00:55:16.955 --> 00:55:27.162
And then also when one thing that I just want to point out is that you
can also think of this as in a way also doing some regularization.

00:55:27.162 --> 00:55:42.733
Right and so, because now at the output of each layer, each of these activations, each of these outputs, is an
output of both your input X, as well as the other examples in the batch that it happens to be sampled with, right,

00:55:42.733 --> 00:55:48.266
because you're going to normalize each input
data by the empirical mean over that batch.

00:55:48.266 --> 00:55:54.021
So because of that, it's no longer producing
deterministic values for a given training example,

00:55:54.021 --> 00:55:57.543
and it's tying all of these
inputs in a batch together.

00:55:57.543 --> 00:56:07.215
And so this basically, because it's no longer deterministic, kind of jitters your
representation of X a little bit, and in a sense, gives some sort of regularization effect.

00:56:08.941 --> 00:56:10.490
Yeah, question?

00:56:10.490 --> 00:56:13.401
[student speaking off camera]

00:56:13.401 --> 00:56:17.354
The question is gamma and beta are learned
parameters, and yes that's the case.

00:56:17.354 --> 00:56:20.937
[student speaking off mic]

00:56:27.754 --> 00:56:34.618
Yeah, so the question is why do we want to learn this gamma
and beta to be able to learn the identity function back,

00:56:34.618 --> 00:56:38.481
and the reason is because
you want to give it the flexibility.

00:56:38.481 --> 00:56:48.381
Right, so what batch normalization is doing, is it's forcing our
data to become this unit gaussian, our inputs to be unit gaussian,

00:56:48.381 --> 00:56:54.232
but even though in general this is a good idea, it's
not always that this is exactly the best thing to do.

00:56:54.232 --> 00:57:00.279
And we saw in particular for something like a tanh, you might
want to control some degree of saturation that you have.

00:57:00.279 --> 00:57:14.195
And so what this does is it gives you the flexibility of doing this exact like unit gaussian normalization, if it
wants to, but also learning that maybe in this particular part of the network, maybe that's not the best thing to do.

00:57:14.195 --> 00:57:19.838
Maybe we want something still in this general idea, but
slightly different right, slightly scaled or shifted.

00:57:19.838 --> 00:57:25.968
And so these parameters just give it that extra
flexibility to learn that if it wants to.

00:57:25.968 --> 00:57:35.665
And then yeah, if the the best thing to do is just batch
normalization then it'll learn the right parameters for that. Yeah?

00:57:35.665 --> 00:57:39.710
[student speaking off mic]

00:57:39.710 --> 00:57:47.079
Yeah, so basically each neuron output. So, we have
output of a fully connected layer. We have W times X.

00:57:48.366 --> 00:57:57.365
and so we have the values of each of these outputs, and then we're going
to apply batch normalization separately to each of these neurons.

00:57:57.365 --> 00:57:58.835
Question?

00:57:58.835 --> 00:58:02.418
[student speaking off mic]

00:58:10.031 --> 00:58:17.517
Yeah, so the question is that for things like reinforcement learning,
you might have a really small batch size. How do you deal with this?

00:58:17.517 --> 00:58:24.324
So in practice, I guess batch normalization has been used a
lot for like for standard convolutional neural networks,

00:58:24.324 --> 00:58:34.520
and there's actually papers on how do we want to do normalization for different kinds of recurrent
networks, or you know some of these networks that might also be in reinforcement learning.

00:58:34.520 --> 00:58:40.532
And so there's different considerations that you might want to
think of there. And this is still an active area of research.

00:58:40.532 --> 00:58:49.490
There's papers on this and we might also talk about some of this more later,
but for a typical convolutional neural network this generally works fine.

00:58:49.490 --> 00:58:57.741
And then if you have a smaller batch size, maybe this becomes a
little bit less accurate, but you still get kind of the same effect.

00:58:57.741 --> 00:59:06.088
And you know it's possible also that you could design your mean
and variance to be computed maybe over more examples, right,

00:59:06.088 --> 00:59:14.755
and I think in practice usually it's just okay, so you don't see this too
much, but this is something that maybe could help if that was a problem.

00:59:14.755 --> 00:59:16.128
Yeah, question?

00:59:16.128 --> 00:59:19.711
[student speaking off mic]

00:59:24.947 --> 00:59:32.979
So the question, so the question is, if we force the
inputs to be gaussian, do we lose the structure?

00:59:35.211 --> 00:59:45.221
So, no in a sense that you can think of like, if you had all your features distributed
as a gaussian for example, even if you were just doing data pre-processing,

00:59:45.221 --> 00:59:47.925
this gaussian is not
losing you any structure.

00:59:47.925 --> 00:59:57.913
All the, it's just shifting and scaling your data into a regime, that
works well for the operations that you're going to perform on it.

00:59:57.913 --> 01:00:03.169
In convolutional layers, you do have some structure,
that you want to preserve spatially, right.

01:00:03.169 --> 01:00:09.156
You want, like if you look at your activation maps, you
want them to relatively all make sense to each other.

01:00:09.156 --> 01:00:17.823
So, in this case you do want to take that into consideration. And so now,
we're going to normalize, find one mean for the entire activation map,

01:00:17.823 --> 01:00:22.815
so we only find the empirical mean
and variance over training examples.

01:00:22.815 --> 01:00:32.455
And so that's something that you'll be doing in your homework, and
also explained in the paper as well. So, you should refer to that.

01:00:32.455 --> 01:00:33.288
Yes.

01:00:34.287 --> 01:00:37.870
[student speaking off mic]

01:00:43.065 --> 01:00:47.849
So the question is, are we normalizing
the weight so that they become gaussian.

01:00:47.849 --> 01:00:49.665
So, if I understand
your question correctly,

01:00:49.665 --> 01:00:58.727
then the answer is, we're normalizing the inputs to each
layer, so we're not changing the weights in this process.

01:01:00.895 --> 01:01:04.562
[student speaking off mic]

01:01:15.208 --> 01:01:24.512
Yeah, so the question is, once we subtract by the mean and divide by the
standard deviation, does this become gaussian, and the answer is yes.

01:01:24.512 --> 01:01:33.843
So, if you think about the operations that are happening, basically you're
shifting by the mean, right. And so this shift up to be zero-centered,

01:01:33.843 --> 01:01:40.243
and then you're scaling by the standard deviation.
This now transforms this into a unit gaussian.

01:01:41.249 --> 01:01:48.630
And so if you want to look more into that, I think you can
look at, there's a lot of machine learning explanations

01:01:48.630 --> 01:01:52.942
that go into exactly what this,
visualizing with this operation is doing,

01:01:52.942 --> 01:01:58.563
but yeah this basically takes your data
and turns it into a gaussian distribution.

01:02:00.458 --> 01:02:02.375
Okay, so yeah question?

01:02:03.436 --> 01:02:07.019
[student speaking off mic]

01:02:08.262 --> 01:02:09.095
Uh-huh.

01:02:26.194 --> 01:02:35.634
So the question is, if we're going to be doing the shift and scale, and learning these
is the batch normalization redundant, because you could recover the identity mapping?

01:02:35.634 --> 01:02:44.523
So in the case that the network learns that identity mapping is always the best, and
it learns these parameters, the yeah, there would be no point for batch normalization,

01:02:44.523 --> 01:02:52.579
but in practice this doesn't happen. So in practice, we will learn
this gamma and beta. That's not the same as a identity mapping.

01:02:52.579 --> 01:02:58.858
So, it will shift and scale by some amount, but not the
amount that's going to give you an identity mapping.

01:02:58.858 --> 01:03:03.201
And so what you get is you still get
this batch normalization effect.

01:03:03.201 --> 01:03:14.266
Right, so having this identity mapping there, I'm only putting this here to say that
in the extreme, it could learn the identity mapping, but in practice it doesn't.

01:03:14.266 --> 01:03:15.970
Yeah, question.

01:03:15.970 --> 01:03:19.553
[student speaking off mic]

01:03:21.368 --> 01:03:22.561
Yeah.

01:03:22.561 --> 01:03:26.144
[student speaking off mic]

01:03:30.825 --> 01:03:37.505
Oh, right, right. Yeah, yeah sorry, I was not clear about this,
but yeah I think this is related to the other question earlier,

01:03:38.972 --> 01:03:49.814
that yeah when we're doing this we're actually getting zero mean and unit gaussian,
which put this into a nice shape, but it doesn't have to actually be a gaussian.

01:03:49.814 --> 01:03:57.830
So yeah, I mean ideally, if we're looking at like inputs
coming in, as you know, sort of approximately gaussian,

01:03:57.830 --> 01:04:03.592
we would like it to have this kind of effect,
but yeah, in practice it doesn't have to be.

01:04:06.658 --> 01:04:14.017
Okay, so ... Okay, so the last thing I just want to mention about
this is that, so at test time, the batch normalization layer,

01:04:17.064 --> 01:04:26.932
we now take the empirical mean and variance from the
training data. So, we don't re-compute this at test time.

01:04:26.932 --> 01:04:38.295
We just estimate this at training time, for example using running averages, and then
we're going to use this as at test time. So, we're just going to scale by that.

01:04:40.078 --> 01:04:43.725
Okay, so now I'm going to move on
to babysitting the learning process.

01:04:43.725 --> 01:04:54.264
Right, so now we've defined our network architecture, and we'll talk about
how do we monitor training, and how do we adjust hyperparameters as we go,

01:04:54.264 --> 01:04:56.681
to get good learning results?

01:04:58.091 --> 01:05:02.251
So as always, so the first step we want to
do, is we want to pre-process the data.

01:05:02.251 --> 01:05:05.773
Right, so we want to zero mean the data
as we talked about earlier.

01:05:05.773 --> 01:05:13.455
Then we want to choose the architecture, and so here we are
starting with one hidden layer of 50 neurons, for example,

01:05:13.455 --> 01:05:18.950
but we've basically we can pick any
architecture that we want to start with.

01:05:20.223 --> 01:05:23.934
And then the first thing that we want
to do is we initialize our network.

01:05:23.934 --> 01:05:28.600
We do a forward pass through it, and we want
to make sure that our loss is reasonable.

01:05:28.600 --> 01:05:35.697
So, we talked about this several lectures ago, where we have a
basically a, let's say we have a Softmax classifier that we have here.

01:05:37.493 --> 01:05:44.012
We know what our loss should be, when our weights
are small, and we have generally a distribution.

01:05:44.012 --> 01:05:50.293
Then we want it to be, the Softmax classifier
loss is going to be your negative log likelihood,

01:05:50.293 --> 01:05:54.826
which if we have 10 classes, it'll be
something like negative log of one over 10,

01:05:54.826 --> 01:06:03.213
which here is around 2.3, and so we want to make
sure that our loss is what we expect it to be.

01:06:03.213 --> 01:06:09.453
So, this is a good sanity check
that we want to always, always do.

01:06:09.453 --> 01:06:13.503
So, now once we've seen that our
original loss is good, now we want to,

01:06:14.853 --> 01:06:25.463
so first we want to do this having zero regularization, right. So, when we disable the
regularization, now our only loss term is this data loss, which is going to give 2.3 here.

01:06:25.463 --> 01:06:36.226
And so here, now we want to crank up the regularization, and when we do that, we want
to see that our loss goes up, because we've added this additional regularization term.

01:06:36.226 --> 01:06:40.879
So, this is a good next step
that you can do for your sanity check.

01:06:40.879 --> 01:06:46.309
And then, now we can start training.
So, now we start trying to train.

01:06:47.331 --> 01:06:53.026
What we do is, a good way to do this is to
start up with a very small amount of data,

01:06:53.026 --> 01:07:00.944
because if you have just a very small training set, you should be able
to over fit this very well and get very good training loss on here.

01:07:00.944 --> 01:07:10.697
And so in this case we want to turn off our regularization
again, and just see if we can make the loss go down to zero.

01:07:12.199 --> 01:07:21.961
And so we can see how our loss is changing, as we have all these epochs. We compute
our loss at each epoch, and we want to see this go all the way down to zero.

01:07:21.961 --> 01:07:27.124
Right, and here we can see that also our training accuracy
is going all the way up to one, and this makes sense right.

01:07:27.124 --> 01:07:32.813
If you have a very small number of data, you
should be able to over fit this perfectly.

01:07:34.726 --> 01:07:40.366
Okay, so now once you've done that, these are all sanity
checks. Now you can start really trying to train.

01:07:40.366 --> 01:07:49.480
So, now you can take your full training data, and now start with a small amount
of regularization, and let's first figure out what's a good learning rate.

01:07:49.480 --> 01:07:54.942
So, learning rate is one of the most important hyperparameters,
and it's something that you want to adjust first.

01:07:54.942 --> 01:08:00.954
So, you want to try some value of learning
rate. and here I've tried one E negative six,

01:08:00.954 --> 01:08:04.096
and you can see that the
loss is barely changing.

01:08:04.096 --> 01:08:10.244
Right, and so the reason this is barely changing is
usually because your learning rate is too small.

01:08:10.244 --> 01:08:16.362
So when it's too small, your gradient updates are not
big enough, and your cost is basically about the same.

01:08:17.423 --> 01:08:29.806
Okay, so, one thing that I want to point out here, is that we can notice that even though our loss
with barely changing, the training and the validation accuracy jumped up to 20% very quickly.

01:08:32.701 --> 01:08:38.152
And so does anyone have any idea
for why this might be the case?

01:08:40.089 --> 01:08:46.403
Why, so remember we have a Softmax function, and our loss
didn't really change, but our accuracy improved a lot.

01:08:50.263 --> 01:08:59.727
Okay, so the reason for this is that here the probabilities are
still pretty diffuse, so our loss term is still pretty similar,

01:08:59.727 --> 01:09:06.183
but when we shift all of these probabilities slightly
in the right direction, because we're learning right?

01:09:06.183 --> 01:09:11.954
Our weights are changing the right direction.
Now the accuracy all of a sudden can jump,

01:09:11.954 --> 01:09:21.985
because we're taking the maximum correct value, and so we're going to get
a big jump in accuracy, even though our loss is still relatively diffuse.

01:09:23.588 --> 01:09:31.325
Okay, so now if we try another learning rate, now here I'm jumping in the
other extreme, picking a very big learning rate, one E negative six.

01:09:31.326 --> 01:09:41.413
What's happening is that our cost is now giving us NaNs. And, when you
have NaNs, what this usually means is that basically your cost exploded.

01:09:41.413 --> 01:09:47.862
And so, the reason for that is typically
that your learning rate was too high.

01:09:49.350 --> 01:09:57.006
So, then you can adjust your learning rate down again. Here I can see that
we're trying three E to the negative three. The cost is still exploding.

01:09:57.006 --> 01:10:04.901
So, usually this, the rough range for learning rates that we want to
look at is between one E negative three, and one E negative five.

01:10:04.901 --> 01:10:09.628
And, this is the rough range that we
want to be cross-validating in between.

01:10:09.628 --> 01:10:19.011
So, you want to try out values in this range, and depending on whether your loss
is too slow, or too small, or whether it's too large, adjust it based on this.

01:10:21.228 --> 01:10:24.399
And so how do we exactly
pick these hyperparameters?

01:10:24.399 --> 01:10:31.139
Do hyperparameter optimization, and pick the
best values of all of these hyperparameters?

01:10:31.139 --> 01:10:37.575
So, the strategy that we're going to use is for any hyperparameter
for example learning rate, is to do cross-validation.

01:10:37.575 --> 01:10:43.472
So, cross-validation is training on your training
set, and then evaluating on a validation set.

01:10:43.472 --> 01:10:48.960
How well do this hyperparameter do? Something that
you guys have already done in your assignment.

01:10:48.960 --> 01:10:51.334
And so typically we want
to do this in stages.

01:10:51.334 --> 01:11:03.473
And so, we can do first of course stage, where we pick values pretty spread out apart, and then we
learn for only a few epochs. And with only a few epochs. you can already get a pretty good sense

01:11:03.473 --> 01:11:07.993
of which hyperparameters,
which values are good or not, right.

01:11:07.993 --> 01:11:13.712
You can quickly see that it's a NaN, or you can see that
nothing is happening, and you can adjust accordingly.

01:11:13.712 --> 01:11:22.540
So, typically once you do that, then you can see what's sort of a pretty good
range, and the range that you want to now do finer sampling of values in.

01:11:22.540 --> 01:11:30.779
And so, this is the second stage, where now you might want to run
this for a longer time, and do a finer search over that region.

01:11:30.779 --> 01:11:47.296
And one tip for detecting explosions like NaNs, you can have in your training loop, right sample
some hyperparameter, start training, and then look at your cost at every iteration or every epoch.

01:11:47.296 --> 01:11:57.902
And if you ever get a cost that's much larger than your original cost, so for example, something
like three times original cost, then you know that this is not heading in the right direction.

01:11:57.902 --> 01:12:06.335
Right, it's getting very big, very quickly, and you can just break out of
your loop, stop this this hyperparameter choice and pick something else.

01:12:06.335 --> 01:12:12.496
Alright, so an example of this, let's say here we
want to run now course search for five epochs.

01:12:13.866 --> 01:12:24.611
This is a similar network that we were talking about earlier, and what we
can do is we can see all of these validation accuracy that we're getting.

01:12:24.611 --> 01:12:29.291
And I've put in, highlighted in red
the ones that gives better values.

01:12:29.291 --> 01:12:33.092
And so these are going to be regions that
we're going to look into in more detail.

01:12:33.092 --> 01:12:37.067
And one thing to note is that it's
usually better to optimize in log space.

01:12:37.067 --> 01:12:49.040
And so here instead of sampling, I'd say uniformly between you know one E to the
negative 0.01 and 100, you're going to actually do 10 to the power of some range.

01:12:49.956 --> 01:12:55.427
Right, and this is because the learning
rate is multiplying your gradient update.

01:12:55.427 --> 01:13:07.524
And so it has these multiplicative effects, and so it makes more sense to consider a range of
learning rates that are multiplied or divided by some value, rather than uniformly sampled.

01:13:07.524 --> 01:13:10.894
So, you want to be dealing
with orders of some magnitude here.

01:13:10.894 --> 01:13:14.379
Okay, so once you find that,
you can then adjust your range.

01:13:14.379 --> 01:13:26.176
Right get in this case, we have a range of you know, maybe of 10 to the negative four,
right, to 10 to the zero power. This is a good range that we want to narrow down into.

01:13:26.176 --> 01:13:37.962
And so we can do this again, and here we can see that we're getting a relatively
good accuracy of 53%. And so this means we're headed in the right direction.

01:13:37.962 --> 01:13:42.377
The one thing that I want to point out
is that here we actually have a problem.

01:13:42.377 --> 01:13:50.396
And so the problem is that we can see that our best
accuracy here has a learning rate that's about,

01:13:52.373 --> 01:13:57.816
you know, all of our good learning rates
are in this E to the negative four range.

01:13:57.816 --> 01:14:10.273
Right, and since the learning rate that we specified was going from 10 to the negative four to 10 to the
zero, that means that all the good learning rates, were at the edge of the range that we were sampling.

01:14:10.273 --> 01:14:11.856
And so this is bad,

01:14:12.693 --> 01:14:17.113
because this means that we might not have
explored our space sufficiently, right.

01:14:17.113 --> 01:14:20.485
We might actually want to go to 10 to the
negative five, or 10 to the negative six.

01:14:20.485 --> 01:14:23.494
There might be still better ranges
if we continue shifting down.

01:14:23.494 --> 01:14:32.839
So, you want to make sure that your range kind of has the good values somewhere in the middle,
or somewhere where you get a sense that you've hit, you've explored your range fully.

01:14:36.224 --> 01:14:43.741
Okay, and so another thing is that we can sample all of our
different hyperparameters, using a kind of grid search, right.

01:14:43.741 --> 01:14:49.731
We can sample for a fixed set of combinations,
a fixed set of values for each hyperparameter.

01:14:49.731 --> 01:15:02.334
Sample in a grid manner over all of these values, but in practice it's actually better to
sample from a random layout, so sampling random value of each hyperparameter in a range.

01:15:02.334 --> 01:15:10.876
And so what you'll get instead is we'll have these two hyper parameters here that
we want to sample from. You'll get samples that look like this right side instead.

01:15:10.876 --> 01:15:19.816
And the reason for this is that if a function is really sort of more
of a function of one variable than another, which is usually true.

01:15:19.816 --> 01:15:24.669
Usually we have little bit more, a lower
effective dimensionality than we actually have.

01:15:24.669 --> 01:15:30.342
Then you're going to get many more samples
of the important variable that you have.

01:15:30.342 --> 01:15:38.326
You're going to be able to see this shape in this green function
that I've drawn on top, showing where the good values are,

01:15:38.326 --> 01:15:46.459
compared to if you just did a grid layout where we were only able to
sample three values here, and you've missed where were the good regions.

01:15:46.459 --> 01:15:55.685
Right, and so basically we'll get much more useful signal overall since
we have more samples of different values of the important variable.

01:15:55.685 --> 01:16:00.427
And so, hyperparameters to play with,
we've talked about learning rate,

01:16:00.427 --> 01:16:07.697
things like different types of decay schedules, update
types, regularization, also your network architecture,

01:16:07.697 --> 01:16:12.405
so the number of hidden units, the depth, all of these
are hyperparameters that you can optimize over.

01:16:12.405 --> 01:16:16.928
And we've talked about some of these, but we'll keep
talking about more of these in the next lecture.

01:16:16.928 --> 01:16:24.781
And so you can think of this as kind of, you know, if you're basically
tuning all the knobs right, of some turntable where you're,

01:16:26.667 --> 01:16:32.260
you're a neural networks practitioner. You can think of the
music that's output is the loss function that you want,

01:16:32.260 --> 01:16:36.313
and you want to adjust everything appropriately
to get the kind of output that you want.

01:16:36.313 --> 01:16:40.480
Alright, so it's really kind
of an art that you're doing.

01:16:42.194 --> 01:16:50.277
And in practice, you're going to do a lot of
hyperparameter optimization, a lot of cross validation.

01:16:50.277 --> 01:17:00.368
And so you know, in order to get numbers, people will run cross validation over tons of
hyperparameters, monitor all of them, see which ones are doing better, which ones are doing worse.

01:17:00.368 --> 01:17:07.895
Here we have all these loss curves. Pick the right
ones, readjust, and keep going through this process.

01:17:07.895 --> 01:17:14.380
And so as I mentioned earlier, as you're monitoring each
of these loss curves, learning rate is an important one,

01:17:15.311 --> 01:17:20.654
but you'll get a sense for how different learning
rates, which learning rates are good and bad.

01:17:20.654 --> 01:17:34.060
So you'll see that if you have a very high exploding one, right, this is your loss explodes, then your learning
rate is too high. If it's too kind of linear and too flat, you'll see that it's too low, it's not changing enough.

01:17:34.060 --> 01:17:41.660
And if you get something that looks like there's a steep change, but
then a plateau, this is also an indicator of it being maybe too high,

01:17:41.660 --> 01:17:48.460
because in this case, you're taking too large jumps, and
you're not able to settle well into your local optimum.

01:17:48.460 --> 01:17:53.572
And so a good learning rate usually ends up looking something
like this, where you have a relatively steep curve,

01:17:53.572 --> 01:17:57.993
but then it's continuing to go down, and then you
might keep adjusting your learning rate from there.

01:17:57.993 --> 01:18:02.160
And so this is something that
you'll see through practice.

01:18:03.522 --> 01:18:12.637
Okay and just, I think we're very close to the end, so just one last thing that
I want to point out is than in case you ever see learning rate loss curves,

01:18:12.637 --> 01:18:23.567
where it's ... So if you ever see loss curves where it's flat for a while, and then
starts training all of a sudden, a potential reason could be bad initialization.

01:18:23.567 --> 01:18:36.383
So in this case, your gradients are not really flowing too well the beginning, so nothing's really learning, and then
at some point, it just happens to adjust in the right way, such that it tips over and things just start training right?

01:18:36.383 --> 01:18:47.901
And so there's a lot of experience at looking at these and see what's wrong that you'll
get over time. And so you'll usually want to monitor and visualize your accuracy.

01:18:48.826 --> 01:18:54.860
If you have a big gap between your training
accuracy and your validation accuracy,

01:18:54.860 --> 01:18:59.652
it usually means that you might have overfitting and you
might want to increase your regularization strength.

01:18:59.652 --> 01:19:08.137
If you have no gap, you might want to increase your model capacity,
because you haven't overfit yet. You could potentially increase it more.

01:19:08.137 --> 01:19:13.998
And in general, we also want to track the updates, the
ratio of our weight updates to our weight magnitudes.

01:19:13.998 --> 01:19:21.428
We can just take the norm of our parameters that
we have to get a sense for how large they are,

01:19:21.428 --> 01:19:26.353
and when we have our update size, we can also take
the norm of that, get a sense for how large that is,

01:19:26.353 --> 01:19:30.025
and we want this ratio to
be somewhere around 0.001.

01:19:30.025 --> 01:19:35.598
There's a lot of variance in this range,
so you don't have to be exactly on this,

01:19:35.598 --> 01:19:41.477
but it's just this sense of you don't want your updates to
be too large compared to your value or too small, right?

01:19:41.477 --> 01:19:43.637
You don't want to dominate
or to have no effect.

01:19:43.637 --> 01:19:47.811
And so this is just something that can
help debug what might be a problem.

01:19:49.843 --> 01:19:59.016
Okay, so in summary, today we've looked at activation functions, data
preprocessing, weight initialization, batch norm, babysitting the learning process,

01:19:59.016 --> 01:20:01.694
and hyperparameter optimization.

01:20:01.694 --> 01:20:05.338
These are the kind of the takeaways for
each that you guys should keep in mind.

01:20:05.338 --> 01:20:08.491
Use ReLUs, subtract the mean,
use Xavier Initialization,

01:20:08.491 --> 01:20:12.499
use batch norm, and sample
hyperparameters randomly.

01:20:12.499 --> 01:20:19.355
And next time we'll continue to talk about the training
neural networks with all these different topics.